NSF PAR Search | NSF Public Access Repository

SuperServe: fine-grained inference serving for unpredictable workloads

Khare, Alind; Garg, Dhruv; Kalra, Sukrit; Grandhi, Snigdha; Stoica, Ion; Tumanov, Alexey (April 2025, USENIX Association)

The increasing deployment of ML models on the critical path of production applications requires ML inference serving systems to serve these models under unpredictable and bursty request arrival rates. Serving many models under such conditions requires a careful balance between each application's latency and accuracy requirements and the overall efficiency of utilization of scarce resources. Faced with this tension, state-of-the-art systems either choose a single model representing a static point in the latency-accuracy tradeoff space to serve all requests or incur latency target violations by loading specific models on the critical path of request serving. Our work instead resolves this tension through a resource-efficient serving of the entire range of models spanning the latency-accuracy tradeoff space. Our novel mechanism, SubNetAct, achieves this by carefully inserting specialized control-flow operators in pre-trained, weight-shared super-networks. These operators enable SubNetAct to dynamically route a request through the network to actuate a specific model that meets the request's latency and accuracy target. Thus, SubNetAct can serve a vastly higher number of models than prior systems while requiring upto 2.6\texttimes{} lower memory. More crucially, SubNetAct's near-instantaneous actuation of a wide-range of models unlocks the design space of fine-grained, reactive scheduling policies. We design one such extremely effective policy, SlackFit, and instantiate both SubNetAct and Slack-Fit in a real system, SuperServe. On real-world traces derived from a Microsoft workload, SuperServe achieves 4.67\% higher accuracy for the same latency targets and 2.85\texttimes{} higher latency target attainment for the same accuracy.

Free, publicly-accessible full text available April 28, 2026

Search for: All records